Whole genome sequencing identifies structural variants contributing to hematologic traits in the NHLBI TOPMed program

Genome-wide association studies have identified thousands of single nucleotide variants and small indels that contribute to variation in hematologic traits. While structural variants are known to cause rare blood or hematopoietic disorders, the genome-wide contribution of structural variants to quantitative blood cell trait variation is unknown. Here we utilized whole genome sequencing data in ancestrally diverse participants of the NHLBI Trans Omics for Precision Medicine program (N = 50,675) to detect structural variants associated with hematologic traits. Using single variant tests, we assessed the association of common and rare structural variants with red cell-, white cell-, and platelet-related quantitative traits and observed 21 independent signals (12 common and 9 rare) reaching genome-wide significance. The majority of these associations (N = 18) replicated in independent datasets. In genome-editing experiments, we provide evidence that a deletion associated with lower monocyte counts leads to disruption of an S1PR3 monocyte enhancer and decreased S1PR3 expression.


WHI
The Women's Health Initiative (WHI) is a long-term, prospective, multi-center cohort study that investigates post-menopausal women's health. WHI was funded by the National Institutes of Health and the National Heart, Lung, and Blood Institute to study strategies to prevent heart disease, breast cancer, colon cancer, and osteoporotic fractures in women 50-79 years of age. WHI involves 161,808 women recruited between 1993 and 1998 at 40 centers across the US. The study consists of two parts: the WHI Clinical Trial which was a randomized clinical trial of hormone therapy, dietary modification, and calcium/Vitamin D supplementation, and the WHI Observational Study, which focused on many of the inequities in women's health research and provided practical information about the incidence, risk factors, and interventions related to heart disease, cancer, and osteoporotic fractures. All WHI participants provided informed consent and the study was approved by the Institutional Review Board (IRB) of the Fred Hutchinson Cancer Research Center. deCODE All participants who donated samples gave informed consent and the National Bioethics Committee of Iceland approved the study which was in agreement with conditions issued by the Data Protection Authority of Iceland (VSN_140-015). Personal identities of the participant's data and biological samples were encrypted by a thirdparty system (Identify Protection System), approved and monitored by the Data Protection Authority.

Genetic Ancestry and Relatedness
Principal components (PCs) of genetic ancestry and pairwise relatedness measures were estimated for all 140,062 samples included in the TOPMed 'Freeze 8' data release. Autosomal genetic variants passing the quality filter with a MAF > 0.01 and missing call rate < 0.01 were LD-pruned with an r 2 threshold of 0.1 to obtain a set of 638,486 effectively independent variants for genetic ancestry and relatedness estimation. PC-AiR was used to obtain ancestry informative PCs robust to familial relatedness; the first 11 PCs showed evidence of population structure. PC-Relate was then used to estimate pairwise kinship coefficients (KCs) for all pairs of samples, conditional on the genetic ancestry captured by PC-AiR PCs 1-11; these KC estimates reflect only recent genetic relatedness, e.g. due to pedigree structure. The PC-Relate KC estimates were used to construct a 4th degree sparse, block-diagonal, empirical kinship matrix (KM) for association testing, any pair of samples with estimated KC > 2^(-11/2) ~ 0.022 were clustered in the same block; all KC estimates within a block of samples were kept, regardless of value; and all KC estimates between blocks were set to 0. By using a sparse block-diagonal KM, the association tests are more computationally efficient yet recent genetic relatedness is still accounted for. We subset the freeze-wide PCs and sparse KM to the appropriate set of participants for each analysis.

Classification of race/ethnicity groups using HARE
Ancestry groups were based on a combination of participants reported race/ethnicity and genetic ancestry represented by PCs from PC-AiR. To infer race/population group membership for participants with missing values, we used the HARE method, a machine learning algorithm that uses a support vector machine (SVM) to determine stratum assignment, taking as input genetically estimated PC values and reported race/ethnicity for each participant. Strata are defined by the unique reported race/ethnicity values provided, then the HARE SVM uses the input (training) data to learn the probability of stratum membership across the entire PC space. The output of HARE consists of multinomial probability vectors of stratum membership for each participant. HARE was run on a subset of samples included in the TOPMed Freeze 8 data release; specifically, samples for participants from non-US-based studies and the Amish participants (because they were very distinct in PC space) were excluded from the HARE analysis. HARE was run using the first 9 PC-AiR PCs generated on this subset of samples to represent genetic ancestry with the following reported race/population groups: Asian, Black, Central American, Cuban, Dominican, Mexican, Puerto Rican, South American, and White. The genetic data from the 31,918 participants with either unreported or non-specific (e.g. 'Multiple' or 'Other') race and population membership was included in the HARE analysis, but they were not used to train the SVM. These participants were assigned to a population stratum based on their highest HARE output probability of membership. All other participants remained in the population stratum corresponding to their reported race/population group. Amish participants were assigned to their own stratum.    . Visualization of WGS reads in single representative samples for conditionally-independent trait associated SVs. Panels show WGS read visualization for a single representative sample predicted by Parliament 2 to exhibit an SV event in the A) 16p13.3 (alpha-globin) locus, B) 17p11.2 (KCNJ18) locus, C) 2p11.2 locus and D) 14q32.33 locus. Note WGS reads were visualized using SAMPLOT and display coverage and paired-end read evidence supporting SV events. Figure S6. Comparison of trait-association SV and SNV p-values in TOPMed for previously-reported SNVs from a European ancestry sample set 10 . The x-axis shows the p-values for the SNV in the TOPMed sample, and the y-axis shows the p-value for the SV that tags that SNV in the TOPMed sample (y axis). Only SNVs with an SV in LD at r^2 > 0.8 are shown.. Points with p < 1e-4 in either analysis are labeled with the trait. All p-values are derived from two-sided t-tests and are not adjusted for multiple comparisons. . The observed and expected chromatin contact frequencies (or counts) are represented by the black and red lines, respectively. The left Y axis displays the range of chromatin contact frequency. The statistical significance (-log10(P-value)) of each long-range chromatin interaction is represented by the blue line, with its range listed in the right Y axis. All p-values are derived from two-sided t-tests and are not adjusted for multiple comparisons. The cell line or tissue specific FDR threshold (5%) is shown as a cyan horizontal dashed line, and the more stringent Bonferroni threshold (P = 0.05) is shown as a blue horizontal dashed line. Figure S8. Epigenome editing implicates S1PR3 in the 9q22.1 monocyte count association (Related to Figure 2).
(A-B) eQTL analysis between the 9q22.1 SV and genes within 1 Mb window in peripheral blood mononuclear cells (PBMCs) (A) and T-cells (B) using data from the ancestry-stratified datasets from the Multi-Ethnic Study of Atherosclerosis (MESA, including n = 297 African American and n = 246 Hispanic/Latino participants for PBMC; n = 78 African American and n = 86 Hispanic/Latino participants for T-cell). AFHI: African American and Hispanic/Latino. All p-values are derived from two-sided t-tests and are not adjusted for multiple comparisons.
(D-E) Expression of genes within a 1 Mb window in THP-1 cells expressing dCas9-KRAB after transduction with sgRNA #2 (D) or sgRNA #3 (E) targeting the S1PR3 SV (orange) as compared to a neutral locus control sgRNA (blue). Relative mRNA level of each gene represented by mean ± standard deviation (SD). N = 3 replicates, where each replicate is a unique cellular transduction by sgRNA cassette. Two-sided Student's t-test. P > 0.05 for all comparisons except where indicated, * P < 0.05. Location of sgRNAs for CRISPRi are indicated (C). P-values for (D) and (E) were 0.043 and 0.033, respectively.
(F) Expression of each gene within a 1 Mb window around the 9q22.1 locus in THP-1 cells relative to GAPDH. Genes of undetectable expression level (CDK20, SPATA31C2, SPIN1, SHC3, GADD45G and UNQ6494) are not shown. N = 3 technical replicates. Relative mRNA level of each gene was represented by mean standard ± deviation (SD). Figure S9. Human hematopoietic reconstitution of immunodeficient mice after S1PR3 editing (Related to Figure 2).

Institution(s) Primary Department
Institution City